Lab 2: Julia Quickstart

Functions, Logic, and Packages

Lab
Author

Leanh Nguyen CEVE 421

Published

Fri., Jan. 19

First steps

We start by loading the packages we will use in this lab

using CSV
using DataFrames
using DataFramesMeta
using Dates
using Plots
using StatsBase: mean
using StatsPlots
using Unitful

Defining a function

In index.qmd, we read in a CSV file from scratch. However, we’d like to repeat this process for each year of data, and to do it in a consistent way so that we can read in the data for all available years into a single file. To do this, we’ll write a function that we can use to read in the data for any year. Specifically, our function will take in the year as an argument, and return a DataFrame with the data for that year.

Before we do that, let’s define a function that will return the filename for a given year. It’s often valuable to stack several functions together.

get_fname(year::Int) = "data/tidesandcurrents-8638610-$(year)-NAVD-GMT-metric.csv"

Now we’re ready to define our function:

function read_tides(year::Int)
    
    # define the CSV file corresponding to our year of choice
    fname = get_fname(year)

    # a constant, don't change this
    date_format = "yyyy-mm-dd HH:MM"
    
    # <YOUR CODE GOES HERE>
    # 1. read in the CSV file and save as a dataframe
    
    df = CSV.read(fname, DataFrame)
    first(df, 5)

    # 2. convert the "Date Time" column to a DateTime object
    df = CSV.read(fname, DataFrame; dateformat=date_format)
    first(df, 3)

    # 3. convert the " Water Level" column to meters
    df[!, " Water Level"] .*= 1u"m"
    first(df, 3)

    # 4. rename the columns to "datetime" and "lsl"
    df = @rename(df, :datetime = $"Date Time", :lsl = $" Water Level");

    # 5. select the "datetime" and "lsl" columns
    df = @select(df, :datetime, :lsl)
    

    # 6. return the dataframe
    return df

end

# print out the first 10 rows of the 1928 data

first(read_tides(1928), 10) 
10×2 DataFrame
Row datetime lsl
DateTime Quantity…
1 1928-01-01T00:00:00 -0.547 m
2 1928-01-01T01:00:00 -0.699 m
3 1928-01-01T02:00:00 -0.73 m
4 1928-01-01T03:00:00 -0.669 m
5 1928-01-01T04:00:00 -0.516 m
6 1928-01-01T05:00:00 -0.364 m
7 1928-01-01T06:00:00 -0.212 m
8 1928-01-01T07:00:00 -0.059 m
9 1928-01-01T08:00:00 -0.029 m
10 1928-01-01T09:00:00 -0.029 m
Instructions

Fill out this function. Your function should implement the six steps indicated in the instructions. Use the example code from index.qmd to help you. When it’s done, convert it to a live code block by replacing ```julia``` with ```{julia}```. When you run this code, it should print out the first 10 rows of the 1928 data. Make sure they look right!

Building the dataset

Now that we have the ability to read in the data corresponding to any year, we can read them all in and combine into a single DataFrame. First, let’s read in all the data.

Instructions
  1. Hint: to vectorize a function means to apply it to each element of a vector. For example, f.(x) will apply the function f to each element of the vector x. This is a very common operation in Julia!
  2. Update the code blocks below, then replace ```julia``` with ```{julia}```.
years = 1928:2021 # all the years of data
annual_data = read_tides.(years) # call the read_tides function on each year (see hint above!)
typeof(annual_data) # should be a vector of DataFrames
Vector{DataFrame} (alias for Array{DataFrame, 1})

Next, we’ll use the vcat function to combine all the data into a single DataFrame.

df = vcat(annual_data...)
first(df, 5)
5×2 DataFrame
Row datetime lsl
DateTime Quantity…?
1 1928-01-01T00:00:00 -0.547 m
2 1928-01-01T01:00:00 -0.699 m
3 1928-01-01T02:00:00 -0.73 m
4 1928-01-01T03:00:00 -0.669 m
5 1928-01-01T04:00:00 -0.516 m

And we can look at the last 5 rows

last(df, 5)
5×2 DataFrame
Row datetime lsl
DateTime Quantity…?
1 2021-12-31T19:00:00 -0.147 m
2 2021-12-31T20:00:00 -0.044 m
3 2021-12-31T21:00:00 0.152 m
4 2021-12-31T22:00:00 0.353 m
5 2021-12-31T23:00:00 0.508 m

Finally, we’ll make sure we drop any missing data.

dropmissing!(df) # drop any missing data
808761×2 DataFrame
808736 rows omitted
Row datetime lsl
DateTime Quantity…
1 1928-01-01T00:00:00 -0.547 m
2 1928-01-01T01:00:00 -0.699 m
3 1928-01-01T02:00:00 -0.73 m
4 1928-01-01T03:00:00 -0.669 m
5 1928-01-01T04:00:00 -0.516 m
6 1928-01-01T05:00:00 -0.364 m
7 1928-01-01T06:00:00 -0.212 m
8 1928-01-01T07:00:00 -0.059 m
9 1928-01-01T08:00:00 -0.029 m
10 1928-01-01T09:00:00 -0.029 m
11 1928-01-01T10:00:00 -0.151 m
12 1928-01-01T11:00:00 -0.303 m
13 1928-01-01T12:00:00 -0.486 m
808750 2021-12-31T12:00:00 0.74 m
808751 2021-12-31T13:00:00 0.66 m
808752 2021-12-31T14:00:00 0.486 m
808753 2021-12-31T15:00:00 0.256 m
808754 2021-12-31T16:00:00 0.024 m
808755 2021-12-31T17:00:00 -0.141 m
808756 2021-12-31T18:00:00 -0.203 m
808757 2021-12-31T19:00:00 -0.147 m
808758 2021-12-31T20:00:00 -0.044 m
808759 2021-12-31T21:00:00 0.152 m
808760 2021-12-31T22:00:00 0.353 m
808761 2021-12-31T23:00:00 0.508 m

Plots

  1. Plot the hourly water levels for March 2020, using subsetting and plotting techniques from the instructions
plot(
    df.datetime,
    df.lsl;
    title="Hourly Water levels at Sewells Point, VA",
    ylabel="Water level",
    label=false,
)
plot(
    df_month.datetime,
    df_month.lsl;
    title="March 2020 Water levels at Sewells Point, VA",
    ylabel="Water level",
    label=false,
)
1
In the instructions, we plotted the average monthly water level from each month using groupby. Repeat this analysis, using the full dataset (all years).
plot(
    df_climatology.month,
    df_climatology.lsl_avg;
    xticks=1:12,
    xlabel="Month",
    ylabel="Average Water level",
    title="Average Monthly Water Level",
    linewidth=3,
    label=false,
)
1
Now repeat the analysis, but group by day of the year. What do you notice? (Hint: use Dates.dayofyear to get the day of the year from a DateTime object)
plot(
    df_climatology.day,
    df_climatology.lsl_avg;
    tickfontsize=5,
    xticks=0:40:365,
    xlabel="Day",
    ylabel="Average Water level",
    title="Average Daily  Water Level",
    linewidth=3,
    label=false,
)

I noticed that both graphs have a similar shape with peaks in June and September. Differences include that the group by day graph is messier as it deals with more data than the group by month graph. Additionally, the group by day graph has significant rises and drops while the group by month graph has a relatively smooth line.

scatter( #for testing purposes
    df_climatology.day,
    df_climatology.lsl_avg;
    tickfontsize=5,
    xticks=0:40:365,
    xlabel="Day",
    ylabel="Average Water level",
    title="Average Daily Water Level - Scatter Plot",
    linewidth=3,
    label=false,
)